BigQuery Commands

Google Cloud Datalab provides a first class experience for working with Google BigQuery for both adhoc, exploratory work as well as pipeline development. In addition to authoring and executing SQL queries, you'll likely want to inspect Datasets, Tables, and their Schema, and also sample data.

This notebook introduces various BigQuery commands that Datalab brings into the notebook environment.

You've already seen a BigQuery command in the Hello BigQuery notebook, they are invoked using %%bq.

Commands, literally

You can think of %%bq as your command line interface to BigQuery. These commands take arguments, including --help (or -h), in the same manner as other shell commands. Lets start by looking at help.


In [5]:
%bq --help


usage: bq [-h]
          {datasets,tables,query,execute,extract,sample,dryrun,udf,datasource,load}
          ...

Execute various BigQuery-related operations. Use "%bq <command> -h" for help
on a specific command.

positional arguments:
  {datasets,tables,query,execute,extract,sample,dryrun,udf,datasource,load}
                        commands
    datasets            Operations on BigQuery datasets
    tables              Operations on BigQuery tables
    query               Create or execute a BigQuery SQL query object,
                        optionally using other SQL objects, UDFs, or external
                        datasources. If a query name is not specified, the
                        query is executed.
    execute             Execute a BigQuery SQL query and optionally send the
                        results to a named table. The cell can optionally
                        contain arguments for expanding variables in the
                        query.
    extract             Extract a query or table into file (local or GCS)
    sample              Display a sample of the results of a BigQuery SQL
                        query. The cell can optionally contain arguments for
                        expanding variables in the query, if -q/--query was
                        used, or it can contain SQL for a query.
    dryrun              Execute a dry run of a BigQuery query and display
                        approximate usage statistics
    udf                 Create a named Javascript BigQuery UDF
    datasource          Create a named Javascript BigQuery external data
                        source
    load                Load data from GCS into a BigQuery table. If creating
                        a new table, a schema should be specified in YAML or
                        JSON in the cell body, otherwise the schema is
                        inferred from existing table.

optional arguments:
  -h, --help            show this help message and exit
None

You can get help on any of the individual commands as follows:


In [2]:
%bq tables --help


usage: bq tables [-h] {list,create,describe,delete,view} ...

Operations on BigQuery tables

positional arguments:
  {list,create,describe,delete,view}
    list                List the tables in a BigQuery project or dataset.
    create              Create a table.
    describe            View a table's schema
    delete              Delete a table.
    view                View a table.

optional arguments:
  -h, --help            show this help message and exit
None

Listing BigQuery Resources

The first thing you might decide to do is list datasets and tables contained within a specified project or the default project you're running this Datalab environment in.

Note that you will be able to list and use any BigQuery dataset that has been shared with the project (its service account) that you are running Datalab in.


In [1]:
%bq datasets list --project cloud-datalab-samples


Out[1]:
  • cloud-datalab-samples.appenginelogs
  • cloud-datalab-samples.carprices
  • cloud-datalab-samples.httplogs

In [2]:
%bq tables list --dataset cloud-datalab-samples.httplogs


Out[2]:
  • BigQuery Table - name: cloud-datalab-samples.httplogs.logs_20140615
  • BigQuery Table - name: cloud-datalab-samples.httplogs.logs_20140616
  • BigQuery Table - name: cloud-datalab-samples.httplogs.logs_20140617
  • BigQuery Table - name: cloud-datalab-samples.httplogs.logs_20140618
  • BigQuery Table - name: cloud-datalab-samples.httplogs.logs_20140619
  • BigQuery Table - name: cloud-datalab-samples.httplogs.logs_20140620

Exploring Data

Once you've found a table or set of tables to work with, its simple to retrieve its schema or some sampling of rows to understand the data you're working with.

Schema


In [3]:
%bq tables describe --name cloud-datalab-samples.httplogs.logs_20140615


Out[3]:

Sampling Data


In [6]:
%bq sample --help


usage: bq sample [-h] [-q QUERY | -t TABLE | -v VIEW] [-nc] [-b BILLING]
                 [-m {limit,random,hashed,sorted}] [--fields FIELDS]
                 [-c COUNT] [-p PERCENT] [--key-field KEY_FIELD]
                 [-o {ascending,descending}] [-P] [--verbose]

Display a sample of the results of a BigQuery SQL query. The cell can
optionally contain arguments for expanding variables in the query, if
-q/--query was used, or it can contain SQL for a query.

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        the name of the query object to sample
  -t TABLE, --table TABLE
                        the name of the table object to sample
  -v VIEW, --view VIEW  the name of the view object to sample
  -nc, --nocache        Don't use previously cached results
  -b BILLING, --billing BILLING
                        BigQuery billing tier
  -m {limit,random,hashed,sorted}, --method {limit,random,hashed,sorted}
                        The type of sampling to use
  --fields FIELDS       Comma separated field names for projection
  -c COUNT, --count COUNT
                        The number of rows to limit to, if sampling
  -p PERCENT, --percent PERCENT
                        For random or hashed sampling, what percentage to
                        sample from
  --key-field KEY_FIELD
                        The field to use for sorted or hashed sampling
  -o {ascending,descending}, --order {ascending,descending}
                        The sort order to use for sorted sampling
  -P, --profile         Generate an interactive profile of the data
  --verbose             Show the expanded SQL that is being executed
None

As you can see, Datalab allows you to sample a specified count of rows using various sampling strategies, including random sampling or hashing. Let's try retrieving a couple of samples.


In [7]:
%bq sample --table cloud-datalab-samples.httplogs.logs_20140615 --count 5


Out[7]:
timestamplatencystatusmethodendpoint
2014-06-15 07:00:00.53648648200GETInteract3
2014-06-15 07:00:00.003772122200GETInteract3
2014-06-15 07:00:00.670100103200GETInteract3
2014-06-15 07:00:00.428897144200GETInteract3
2014-06-15 07:00:00.65276028405GETInteract2

(rows: 5, time: 1.4s, 24MB processed, job: job_TFdKfRs19MWQHsxAQNiu28q-4Nc)

You can also specify a list of fields to project.


In [8]:
%bq sample --table cloud-datalab-samples.httplogs.logs_20140615 --count 10 --fields timestamp,latency --method hashed --key-field latency


Out[8]:
timestamplatency
2014-06-15 07:00:01.75020726
2014-06-15 07:00:01.16691227
2014-06-15 07:00:00.65276028
2014-06-15 07:00:00.94307528
2014-06-15 07:00:01.69421029
2014-06-15 07:00:01.73510137
2014-06-15 07:00:00.53648648
2014-06-15 07:00:01.456010123
2014-06-15 07:00:01.000102124
2014-06-15 07:00:00.428897144

(rows: 10, time: 2.7s, 10MB processed, job: job_3w3djhFW_PtOA9BD65YILcu3BG8)

Querying Data

Of course, since, querying BigQuery data using SQL is the primary scenario, it's always handy to have the BigQuery SQL reference. Datalab makes this easy to access by providing a direct link to the query reference via the Help Links button on the top toolbar.

Lets see some more query examples.


In [9]:
%%bq query 
SELECT timestamp, latency
FROM `cloud-datalab-samples.httplogs.logs_20140615`
ORDER BY latency DESC
LIMIT 10


Out[9]:
timestamplatency
2014-06-15 10:15:06.35409313500918
2014-06-15 09:15:04.57191113500391
2014-06-15 10:00:05.26487713500340
2014-06-15 09:45:04.87323313500166
2014-06-15 09:30:07.40166513500118
2014-06-16 00:01:45.3115641197339
2014-06-15 15:17:12.987435136467
2014-06-15 15:16:33.290520130296
2014-06-15 15:15:56.727494121277
2014-06-15 15:15:58.25921597192

(rows: 10, time: 1.5s, 10MB processed, job: job_cUDvTK58OQa1slx8GXYUdZr9oGM)

That was a simple query. Now, let's do something a bit more interesting: let's compute the 50th percentile, 95th percentile, and 99th percentile latencies for request processing.

BigQuery makes this effortless with its built-in statistical and analytics functions, such as QUANTILES.


In [10]:
%%bq query
SELECT
  quantiles[SAFE_ORDINAL(50)] AS latency_50th_percentile,
  quantiles[SAFE_ORDINAL(95)] AS latency_95th_percentile,
  quantiles[SAFE_ORDINAL(99)] AS latency_99th_percentile
FROM (
  SELECT APPROX_QUANTILES(latency, 100) AS quantiles
  FROM `cloud-datalab-samples.httplogs.logs_20140615`
)


Out[10]:
latency_50th_percentilelatency_95th_percentilelatency_99th_percentile
814185047

(rows: 1, time: 4.8s, 5MB processed, job: job_YsxfaYnndcd2RzavBk0Zw7BLXkM)

Visualizing Data

Beyond tables, it almost always interesting to be able to visualize the data to get a more meaningful view of aggregates, trends and patterns.

Let's write another query, also using QUANTILES. This time, the ROW_NUMBER function will be used to also include a row number in the output data representing the percentile, using the number as the chart's x-axis value. In order to reference this SQL query later in the notebook, it is given a name via the --name/-n argument. This query object can then be passed to a chart by referencing its name. The chart will execute the contained query.


In [11]:
%%bq query --name data
WITH quantiles AS (
  SELECT APPROX_QUANTILES(LOG10(latency), 50) AS timearray
  FROM `cloud-datalab-samples.httplogs.logs_20140615`
  WHERE latency <> 0
)
select row_number() over(order by time) as percentile, time from quantiles cross join unnest(quantiles.timearray) as time
order by percentile

In [12]:
%chart columns --data data --fields percentile,time


Out[12]:

Looking Ahead

There are other commands, such as those that import (load) and export (extract) data or that handle tables and datasets.

Datalab allows queries to be constructed one step at a time to create composite SQL queries that use different constructs such as User Defined Functions and External Data Sources, in order to harness the full power of BigQuery SQL while managing authoring complexity.

All of these BigQuery commands are implemented on top of Python BigQuery APIs (in the google.datalab.bigquery Python module). This implementation not only allows you to write arbitrary code and logic while working with BigQuery data, but also lets you integrate SQL and Python, and the Python data analysis libraries such as pandas and matplotlib, to perform sophisticated and custom data analysis and visualization tasks.

These topics are covered in other BigQuery tutorial notebooks that are included with Datalab.